title: “Analyzing Financial Data to Predict Net Income of a Building Services Company (2019–2023)” author: “Shelley Wright” date: “2025-04-26” format: html: self-contained: true toc: true warning: false number-sections: true code-fold: true code-tools: true
execute: echo: true
—
1 Introduction
Predicting net income accurately is vital for operational planning and strategic decision-making in business industries. Reliable forecasts enable organizations to allocate resources effectively, plan for growth, and mitigate potential financial risks. This study analyzes detailed financial data collected from a building services company spanning the years 2019 to 2023, with the primary goal of modeling and predicting net income based on a variety of operational factors. Key predictor variables examined include service class, fiscal year, expenditures on cleaning supplies, purchases of paper goods, and payroll expenses.
To identify the most suitable predictive framework, multiple linear regression models were constructed and compared, assessing model performance based on fit statistics and diagnostic measures. During the course of the analysis, log transformations were applied to correct for skewed distributions and improve linearity, while robust regression techniques were employed to better handle the presence of outliers and heteroscedasticity that could otherwise distort model estimates. These methodological adjustments led to improved model performance; however, residual diagnostics suggested that additional refinements could further enhance predictive accuracy. Overall, this study highlights both the challenges and opportunities involved in financial modeling for service-based industries and lays the groundwork for future enhancements using more complex modeling approaches or expanded datasets.
2 Setup
3 Data Cleaning and Preparation
The dataset underwent extensive cleaning procedures to ensure accuracy, consistency, and readiness for statistical analysis. Initial steps involved correcting data inconsistencies, standardizing formats, and addressing missing values to maintain the integrity of the analysis. Financial data was systematically exported into Excel files, organized separately by year to facilitate easier handling and year-over-year comparisons. Each exported file contained comprehensive financial records for all accounts managed by the company during that period.
To streamline the analysis and maintain confidentiality, account names were removed and each account was categorized into one of four broad classes: commercial, industrial, medical, or university-affiliated. This classification allowed for clearer segmentation of financial patterns across different types of service clients. A categorical variable representing the year was then added to each dataset to preserve the time dimension essential for longitudinal analysis. Given the original format of the exported files, each dataset required transposition so that financial measures were properly aligned with their corresponding accounts and years.
Following these transformations, the individual yearly datasets were carefully merged into a single, unified dataset to serve as the foundation for the predictive modeling process. This merging process was performed with close attention to maintaining consistency across variables and ensuring that no financial records were inadvertently lost or duplicated. The resulting dataset provided a clean, structured, and comprehensive resource suitable for in-depth regression modeling and further statistical exploration.
4 Demographics
To gain a deeper understanding of the underlying characteristics within the dataset, demographic summaries were created based on key financial and operational variables. In this context, “demographics” refers to the classification of accounts according to factors such as service class (commercial, industrial, medical, or university-affiliated) and fiscal year of service. These groupings allowed for the identification of patterns and differences across various types of accounts, providing essential context for interpreting financial trends over time. By summarizing the distribution of accounts across classes and years, it became possible to observe which sectors contributed most significantly to net income and how these contributions shifted annually. These demographic breakdowns were critical in informing the selection and construction of the regression models, ensuring that the models captured not just overall financial performance, but also the unique contributions of different account types and operational years to the company’s financial outcomes.
The pie chart titled “Number of Different Accounts from 2019–2023” provides a visual summary of the distribution of account types within the company’s financial data. It reveals that commercial accounts make up the majority of the company’s business, representing 279 accounts, followed by industrial (81 accounts), medical (62 accounts), and university-affiliated accounts (44 accounts). This breakdown highlights the company’s heavy reliance on the commercial sector for its financial stability. Understanding the relative proportions of account types is important because it informs the regression modeling process — specifically, it ensures that variations in net income are interpreted within the context of the underlying client demographics. Differences in service class may drive variability in financial performance, making this demographic breakdown a critical component of the overall analysis.
Code
# Load the necessary librarylibrary(plotrix)# Set the background color to light bluepar(bg ="lightblue")# Create the class counts tableclass_counts <-table(PandL$Class)# Define custom labelslabels <-c("Commercial", "Industrial", "Medical", "University")# Create labels with counts includedlabels_with_counts <-paste0(labels, "\n(", class_counts, ")")# Create the exploding 3D pie chartpie3D( class_counts,labels = labels_with_counts,explode =0.3,height =0.2,main ="Number of Different Accounts from 2019–2023",labelcex =1.1,col =rainbow(length(class_counts)))
The bar graph titled “Number of Accounts by Year” displays the distribution of accounts from 2019 to 2023. Each bar represents the total number of accounts recorded in a given year, with the exact count labeled above each bar for clarity. The number of accounts remained relatively consistent over the five-year period, ranging from 90 accounts in 2019 to 98 accounts in 2020, which marked the peak. Subsequent years showed minimal variation, with 91 accounts in 2021, 94 in 2022, and 93 in 2023.
This stability in account numbers across years is important for the modeling process, as it suggests that there were no major fluctuations in client volume that could introduce bias or disproportionately influence the regression analysis. Consistent sample sizes across years support more reliable year-to-year comparisons and help ensure that any trends detected in net income are more likely attributable to operational or financial variables rather than shifts in the size of the client base.
Code
# Summarize to get countsaccounts_per_year <- PandL %>%group_by(Year) %>%summarise(account_count =n()) %>%arrange(Year)# Create the bar plotggplot(accounts_per_year, aes(x =factor(Year), y = account_count)) +geom_col(fill ="steelblue", color ="black", width =0.7) +geom_text(aes(label = account_count), vjust =-0.5, size =5, fontface ="bold") +labs(title ="Number of Accounts by Year",x ="Year",y ="Number of Accounts" ) +ylim(0, 110) +theme_minimal(base_size =14) +theme(plot.title =element_text(hjust =0.5, face ="bold"),axis.text.x =element_text(angle =0, vjust =0.5),panel.grid.major.y =element_line(color ="gray80"),panel.grid.minor =element_blank(),plot.background =element_rect(fill ="lightyellow", color =NA),panel.background =element_rect(fill ="lightyellow", color =NA) )
Figure 1: Number of Accounts by Year (2019–2023)
The line graph titled “Total Net Income by Year” illustrates the overall net income generated by the company from 2019 to 2023. Net income values are plotted for each year, with specific dollar amounts labeled at each point and a dollar sign symbol marking each year’s income. The y-axis represents Net Income ($), while the x-axis displays the years in chronological order.
The graph shows a clear upward trend from 2019 to 2020, where net income increased significantly from $921,215 to a peak of $1,326,715. After reaching this peak in 2020, net income steadily declined over the following years, dropping to $1,169,509 in 2021, $1,052,897 in 2022, and $1,052,813 in 2023. While the rate of decline slowed between 2022 and 2023, the overall trend after 2020 indicates a decrease in profitability.
In the context of this project, this trend is crucial for understanding financial dynamics over time and emphasizes the need for predictive modeling. The sharp increase in 2020 followed by a gradual decline suggests that specific operational factors—such as changes in payroll expenses, supply costs, or shifts in account types—may have impacted net income. By incorporating variables like year, class, payroll expenses, cleaning supplies, and paper goods into regression models, the analysis aims to identify which factors most strongly influenced these income patterns. This understanding will not only enhance prediction accuracy but also provide insights for future financial planning and decision-making within the company.
Code
####Line graph for net income by year ###### Summarize Net Income by Yearnet_income_by_year <- PandL %>%group_by(Year) %>%summarise(Net_Income =sum(Net.Income, na.rm =TRUE)) %>%arrange(Year)# Calculate max y for paddingmax_income <-max(net_income_by_year$Net_Income, na.rm =TRUE)y_limit <- max_income *1.25# Add 25% padding# Fancy Line Chart with Dollar Signs as Pointsggplot(net_income_by_year, aes(x = Year, y = Net_Income)) +geom_line(color ="steelblue", linewidth =1.5) +# Dollar sign symbols as data pointsgeom_text(aes(label ="$"),size =8, # Adjust size as neededvjust =0.5,color ="darkgreen",fontface ="bold" ) +# Actual Net Income values above each pointgeom_text(aes(label =comma(round(Net_Income))),vjust =-2.5,size =3.5,fontface ="bold" ) +labs(title ="Total Net Income by Year",x ="Year",y ="Net Income ($)" ) +scale_y_continuous(labels =dollar_format(),limits =c(500000, y_limit) ) +theme_minimal(base_size =14) +theme(plot.title =element_text(hjust =0.5, face ="bold"),panel.background =element_rect(fill ="lightblue", color =NA),plot.background =element_rect(fill ="lightgrey", color =NA) )
Figure 2: Total Net Income by Year (2019–2023)
The figure below shows the trend in Cleaning, Paper, and Payroll expenses from 2019 to 2023. Payroll expenses were consistently the highest, increasing steadily over the period. Cleaning expenses peaked in 2020 before declining and stabilizing, while Paper expenses remained relatively low and stable throughout the study period. Payroll expenses were consistently the highest, increasing steadily over the period. Cleaning expenses peaked in 2020 before declining and stabilizing, while Paper expenses remained relatively low and stable throughout the analysis.
Code
#| label: fig-expenses-over-time#| fig-cap: "Expenses by Category Over Time (2019–2023)"#| echo: true#| code-fold: trueggplot(expense_long, aes(x = Year, y = Amount, color = Expense_Type)) +geom_line(size =1.5) +geom_point(size =3) +geom_text(aes(label =comma(round(Amount))),vjust =-0.7,size =2.5,fontface ="bold",show.legend =FALSE ) +scale_y_continuous(labels =dollar_format(),breaks =seq(0, 1600000, by =250000) ) +labs(title ="Expenses by Category Over Time",x ="Year",y ="Total Expense ($)",color ="Expense Type" ) +theme_minimal(base_size =14) +theme(plot.title =element_text(hjust =0.5, face ="bold"),legend.position ="top",panel.background =element_rect(fill ="lightyellow", color =NA),plot.background =element_rect(fill ="lightyellow", color =NA),plot.margin =margin(5,5,5,5) # more compact )
The final demographic illustration of demographics is a 3D scatter plot titled “Predicted Net Income by Payroll and Year” illustrates the relationship between payroll expenses, year, and net income for the building services company from 2019 to 2023. Payroll expenses are plotted on the x-axis, years are ordered along the y-axis, and net income is represented on the z-axis. Each point reflects an account in a given year, with colors distinguishing between years and lines connecting observations within the same year. The plot shows a generally positive relationship between payroll expenses and net income, although the trend is not perfectly consistent across years. Net income peaked in 2020, followed by a gradual flattening in subsequent years despite continued increases in payroll expenses. This visualization highlights the importance of including both payroll and year in the predictive models and suggests the need for a flexible modeling approach to account for variability over time.
Code
##### 3D Plot of Net Income by Year by Payroll ####library(plotly)PandL$Year <-as.factor(PandL$Year)PandL$YearNum <-as.numeric(as.character(PandL$Year)) # Needed for axis orderingplot_ly( PandL, x =~Payroll.Expenses, y =~YearNum, z =~Net.Income, color =~Year,colors =c("cyan", "orange", "pink", "yellow", "purple"),type ="scatter3d", mode ="lines+markers",marker =list(size =8,line =list(color ='black', width =2) ),line =list(width =5),width =900,height =600) %>%layout(margin =list(l =60, r =60, b =50, t =50), # Tighter top marginscene =list(camera =list(eye =list(x =1.6, y =-1.6, z =1.2)), # Nicely centered viewxaxis =list(title =list(text ="Payroll Expenses", font =list(size =14)),tickfont =list(size =11),gridcolor ='grey50', # Darker grid lineszerolinecolor ='grey50'# Darker zero line ),yaxis =list(title =list(text ="Year", font =list(size =14)),tickvals =c(2019, 2020, 2021, 2022, 2023),ticktext =c("2019", "2020", "2021", "2022", "2023"),tickfont =list(size =11),gridcolor ='grey50', # Darker grid lineszerolinecolor ='grey50' ),zaxis =list(title =list(text ="Net Income", font =list(size =14)),tickfont =list(size =11),gridcolor ='grey50', # Darker grid lineszerolinecolor ='grey50' ) ),title =list(text ="Predicted Net Income by Payroll and Year",font =list(size =22),x =0.5,xanchor ="center",yanchor ="top" ),plot_bgcolor ='#e6f7ff', # Light blue inside plotpaper_bgcolor ='#e6f7ff'# Light blue outside plot )
Predicted Net Income by Payroll and Year
5 Methods
5.1 Initial Regression Model
A multiple linear regression was performed to predict net income based on year, payroll expenses, paper goods expenses, and cleaning supplies expenses. The model was statistically significant overall (F(7, 458) = 273.7, p < 2.2e-16), indicating that, together, the predictors explain a substantial portion of the variance in net income. The Multiple R-squared value was 0.8071, and the Adjusted R-squared was 0.8041, suggesting that approximately 80% of the variability in net income is explained by the included predictors, a very strong model fit for financial data.
Examining the individual predictors, payroll expenses had a highly significant positive association with net income (β = 0.945, p < 0.001), meaning that as payroll expenses increase, net income tends to increase proportionally. Similarly, paper goods expenses (β = 0.735, p < 0.001) and cleaning supplies expenses (β = 0.530, p < 0.001) were also significantly positively associated with net income, though with slightly smaller effects compared to payroll.
The year variable showed mixed results. Compared to the baseline year (2019), only 2020 showed a statistically significant increase in net income (β = 3,154, p = 0.028), while differences for 2021, 2022, and 2023 were not statistically significant, suggesting that after the sharp rise in 2020, year-to-year changes were less influential once operational costs were accounted for.
The residuals of the model had a standard error of 9,747, with residuals distributed fairly symmetrically around zero, though a few larger deviations indicate the presence of some variation not fully explained by the model.
Overall, the model demonstrates that operational expenses — particularly payroll — are strong predictors of net income in this dataset, with year effects being most notable in 2020.
Code
model <-lm(Net.Income ~ Year + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)summary(model)
Call:
lm(formula = Net.Income ~ Year + Payroll.Expenses + Paper.Goods +
Cleaning.Supplies, data = PandL)
Residuals:
Min 1Q Median 3Q Max
-40603 -2455 1093 3461 64063
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.413e+02 1.063e+03 -0.886 0.376140
Year2020 3.154e+03 1.431e+03 2.204 0.028045 *
Year2021 1.220e+02 1.451e+03 0.084 0.933037
Year2022 -1.461e+03 1.440e+03 -1.014 0.311022
Year2023 -1.487e+03 1.444e+03 -1.030 0.303784
Payroll.Expenses 9.451e-01 2.981e-02 31.703 < 2e-16 ***
Paper.Goods 7.349e-01 1.592e-01 4.616 5.09e-06 ***
Cleaning.Supplies 5.298e-01 1.566e-01 3.383 0.000779 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9747 on 458 degrees of freedom
Multiple R-squared: 0.8071, Adjusted R-squared: 0.8041
F-statistic: 273.7 on 7 and 458 DF, p-value: < 2.2e-16
Code
# Set up 2x2 plot gridpar(mfrow =c(2, 2))# Plot standard diagnostic plotsplot(model)
Figure 3: Residual Diagnostics for Regression Model
6 Model Diagnostics
Residual analysis revealed several key diagnostic issues with the initial linear regression model. The Residuals vs Fitted plot showed that most points were centered around zero, which is desirable; however, other diagnostic plots indicated violations of core regression assumptions. The Q-Q plot exhibited heavy deviations in the tails, suggesting that the residuals were not normally distributed. The Scale-Location plot displayed an upward trend, indicating the presence of heteroscedasticity, where the variance of residuals increases with fitted values. Additionally, the Residuals vs Leverage plot identified a few high-leverage points that could disproportionately influence the model estimates. Collectively, these diagnostic findings highlighted instability in the residuals and significant departures from ideal regression conditions. As a result, further modeling adjustments were pursued to address these issues and improve the overall fit and robustness of the predictive model.
6.0.1 Transformation
A log transformation was applied to net income to address skewness, reduce the influence of extreme values, and improve adherence to linear regression assumptions of normality and homoscedasticity.
Code
### Regression Model 2: Log Net Income Predictors#| label: tbl-model2-summary#| tbl-cap: "Summary of Linear Model 2: Predictors of Log Net Income"#| echo: true#| code-fold: true#| warning: false PandL <- PandL %>%filter(Net.Income >0)# Fit the modelmodel2 <-lm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)# Display model summarysummary(model2)
Call:
lm(formula = log(Net.Income) ~ Year + Class + Payroll.Expenses +
Paper.Goods + Cleaning.Supplies, data = PandL)
Residuals:
Min 1Q Median 3Q Max
-3.6344 -0.7826 0.3639 0.9124 2.0505
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.804e+00 1.690e-01 46.174 < 2e-16 ***
Year2020 -2.420e-02 1.842e-01 -0.131 0.8955
Year2021 -1.929e-01 1.873e-01 -1.030 0.3036
Year2022 -3.999e-01 1.845e-01 -2.167 0.0308 *
Year2023 -3.688e-01 1.850e-01 -1.993 0.0468 *
Class 6.628e-02 6.144e-02 1.079 0.2813
Payroll.Expenses 4.361e-05 3.761e-06 11.596 < 2e-16 ***
Paper.Goods 8.909e-05 2.006e-05 4.441 1.13e-05 ***
Cleaning.Supplies 4.625e-05 2.081e-05 2.222 0.0268 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.228 on 442 degrees of freedom
Multiple R-squared: 0.4367, Adjusted R-squared: 0.4265
F-statistic: 42.84 on 8 and 442 DF, p-value: < 2.2e-16
Code
par(mfrow =c(2, 2))plot(model2)
A multiple linear regression was performed to predict log-transformed net income using year, service class, payroll expenses, paper goods, and cleaning supplies. The model was statistically significant (F(8, 442) = 42.84, p < 2.2e-16) and explained approximately 43% of the variance in log net income (Adjusted R² = 0.4265). Payroll expenses remained a strong positive predictor (p < 0.001), while paper goods and cleaning supplies also showed significant positive effects. Net income was significantly lower in 2022 and 2023 compared to 2019 after adjusting for expenses. Service class was not a significant predictor. Although fewer observations were used due to missingness after log transformation, the model improved in meeting key regression assumptions.
The diagnostic plots for the log-transformed regression model reveal both improvements and ongoing limitations when compared to the original OLS model without transformation. In the Residuals vs Fitted plot, the residuals show less dramatic fanning and fewer extreme deviations compared to the original model, suggesting that the log transformation helped reduce heteroscedasticity. However, there is still some curvature and spread in the residuals, indicating that non-linearity and non-constant variance have not been entirely eliminated. In the Q-Q plot, the residuals now align much more closely with the theoretical normal line than in the original OLS model, particularly in the center of the distribution, although some deviation persists in the tails. This suggests improved but not perfect normality of residuals after transformation.
The Scale-Location plot shows that variability of the residuals across fitted values is more stabilized than before, but some increasing spread remains, especially at higher fitted values. Finally, the Residuals vs Leverage plot indicates that most points have low leverage, but a few observations still exert some influence on the model, which could continue to affect model stability.
Overall, compared to the original OLS model without transformation, the log-transformed model exhibits better adherence to linear regression assumptions, particularly in terms of normality and variance homogeneity. However, some minor issues with non-linearity, residual spread, and influential points remain, suggesting that further refinement—such as moving to a robust regression model—may provide additional improvements by reducing sensitivity to these outliers.
7 Robust Regression Analysis
Given persistent issues, a Robust Regression was employed to reduce the influence of outliers. Robust regression was chosen because the outliers are distorting the ordinary least square estimates, the residuals are non-normally distributed and the residual plots still indicate heteroscedasticity. The robust model used M-estimation to downweight extreme residuals.
Code
### Robust Regression Model: Log Net Income Predictors#| label: tbl-robust-model-summary#| tbl-cap: "Summary of Robust Regression Model for Log Net Income"#| echo: true#| code-fold: true#| warning: false #| message: falselibrary(MASS)PandL <- PandL %>%filter(Net.Income >0)# Fit the robust regression modelrobust_model <-rlm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)# Display summary of the robust modelsummary(robust_model)
Call: rlm(formula = log(Net.Income) ~ Year + Class + Payroll.Expenses +
Paper.Goods + Cleaning.Supplies, data = PandL)
Residuals:
Min 1Q Median 3Q Max
-3.8184 -0.8593 0.2396 0.8067 1.9650
Coefficients:
Value Std. Error t value
(Intercept) 7.9230 0.1623 48.8021
Year2020 0.0734 0.1769 0.4150
Year2021 -0.1213 0.1799 -0.6742
Year2022 -0.3536 0.1773 -1.9948
Year2023 -0.3332 0.1777 -1.8751
Class 0.0338 0.0590 0.5725
Payroll.Expenses 0.0000 0.0000 12.1309
Paper.Goods 0.0001 0.0000 4.4724
Cleaning.Supplies 0.0000 0.0000 2.2124
Residual standard error: 1.222 on 442 degrees of freedom
Code
#| label: fig-robust-residual-diagnostics#| fig-cap: "Residual Diagnostics for Robust Regression Model"#| echo: true#| code-fold: true#| warning: false # Set 2x2 plotting areapar(mfrow =c(2, 2))# Plot residuals vs fitted valuesplot(fitted(robust_model), residuals(robust_model),main ="Residuals vs Fitted",xlab ="Fitted Values",ylab ="Residuals",pch =20, col ="blue")abline(h =0, col ="red", lty =2)# Normal Q-Q plotqqnorm(residuals(robust_model),main ="Normal Q-Q Plot",pch =20, col ="blue")qqline(residuals(robust_model), col ="red", lty =2)# Scale-Location plotplot(fitted(robust_model), sqrt(abs(residuals(robust_model))),main ="Scale-Location",xlab ="Fitted Values",ylab ="Sqrt(|Residuals|)",pch =20, col ="blue")# Residuals vs Leverage plot (approximate leverage from lm model)plot(hatvalues(model2), residuals(robust_model),main ="Residuals vs Leverage (Approximate)",xlab ="Leverage (from lm model)",ylab ="Residuals",pch =20, col ="blue")
The model estimates indicate that payroll expenses remained a highly significant and strong positive predictor of net income (t = 12.13), similar to the findings from previous OLS models. Paper goods expenses (t = 4.47) and cleaning supplies expenses (t = 2.21) also maintained significant positive relationships. Regarding time effects, 2022 showed a statistically significant decline in net income compared to 2019 (t = -1.99), and 2023 showed a similar negative trend (t = -1.88), although slightly less significant. The year 2020 and 2021 did not show meaningful differences from the baseline year. Service class again was not statistically significant, reinforcing earlier findings that type of account was not a major driver of net income variation once operational expenses were considered.
The residual standard error for the robust model was 1.222, very close to the residual standard error observed in the log-transformed OLS model (1.228), suggesting that the robust model achieved similar fit but with the advantage of being less sensitive to influential outliers. Fifteen observations were still excluded due to missingness (likely due to log(0) issues), consistent with previous analyses.
Overall, the robust regression confirms the key conclusions of the log-transformed OLS model — payroll expenses, paper goods, and cleaning supplies are strong predictors of net income, and net income showed a modest decline in the later years of the study. However, by reducing the influence of outliers, the robust model provides greater confidence in these estimates, especially in a financial dataset where a few extreme accounts could otherwise distort results.
The diagnostic plots from the robust regression model show further improvements compared to the original OLS and log-transformed OLS models. In the Residuals vs Fitted plot, the residuals are more symmetrically distributed around zero, although some slight curvature and variability remain at lower fitted values. Compared to the earlier OLS plots, the robust regression has reduced the influence of extreme residuals, making the pattern less pronounced and more consistent.
The Normal Q-Q plot shows some improvement in normality relative to previous models. While minor deviations from the theoretical line persist, particularly in the tails, the central portion of the distribution aligns more closely with normal expectations. This suggests that while the residuals are not perfectly normal, they are much closer to normality under the robust modeling approach.
In the Scale-Location plot, the spread of residuals across fitted values appears more stabilized than in the earlier models. Most points cluster at lower fitted values, and while some heteroscedasticity remains, particularly at the lower end, it is less extreme than previously observed. The overall pattern suggests an improvement in variance homogeneity.
The Residuals vs Leverage plot indicates that, similar to previous models, most points have low leverage, but a small number of observations still show higher leverage. However, the influence of these points appears less severe, reflecting the robust regression’s ability to minimize the impact of high-leverage outliers on the model’s coefficients.
Overall, compared to both the initial OLS model and the log-transformed OLS model, the robust regression diagnostics demonstrate better performance in terms of residual symmetry, variance stabilization, and reduced sensitivity to influential points. Although minor issues still remain, the robust model represents a meaningful refinement, offering more reliable and stable coefficient estimates for interpreting the financial relationships within the dataset.
Code
# Load librarieslibrary(dplyr)library(ggplot2)library(MASS)PandL <- PandL %>%filter(Net.Income >0)# Clean the dataset: drop NAs and filter positive Net IncomePandL_clean <- PandL %>%drop_na(Net.Income, Payroll.Expenses, Paper.Goods, Cleaning.Supplies, Year, Class) %>%filter(Net.Income >0) # Only keep rows where Net Income is positive# Fit the robust modelrobust_model <-rlm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies,data = PandL_clean)# Create a dataset with actual and fitted valuesPandL_filtered <- PandL_clean %>%mutate(actual_log_income =log(Net.Income),fitted_robust =predict(robust_model, newdata = PandL_clean) )# Plotggplot(PandL_filtered, aes(x = Payroll.Expenses, y = actual_log_income)) +geom_point(alpha =0.7, color ="steelblue", size =2) +geom_line(aes(y = fitted_robust), color ="firebrick", linewidth =2) +labs(title ="Predicted vs Actual (Robust Regression Model)",x ="Payroll Expenses",y ="Log of Net Income" ) +theme_minimal(base_size =14) +theme(panel.background =element_rect(fill ="grey90", color =NA),plot.background =element_rect(fill ="grey95", color =NA),panel.grid.major =element_line(color ="grey70"),panel.grid.minor =element_line(color ="grey85"),axis.line =element_line(color ="black"),plot.title =element_text(face ="bold", hjust =0.5) )
The plot demonstrates a clear positive association between Payroll Expenses and Net Income on a logarithmic scale. As payroll spending increases, companies generally achieve higher levels of profitability, suggesting that investment in human resources and labor costs may be strongly tied to financial performance. The relationship is more variable at lower payroll levels, indicating that smaller companies or organizations with modest payrolls experience a wider range of financial outcomes. In contrast, among organizations with higher payroll expenses, the relationship between payroll and profitability becomes more stable and predictable. The use of a robust regression model strengthens this conclusion by reducing the influence of extreme outliers, allowing the central trend to emerge more clearly.
8 Results
Code
### Comparison of Regression Models#| label: tbl-model-LS, Log-Transformed OLS, and Robust Regression Models Predicting Net Income"#| echo: true#| code-fold: true#| warning: false #| message: false# Load necessary librarieslibrary(MASS)library(modelsummary)# Load your datasetPandL <-read.csv("PandL.csv") # <-- adjust path if neededPandL <- PandL %>%filter(Net.Income >0)# Fit the modelsmodel1 <-lm(Net.Income ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)model2 <-lm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)robust_model <-rlm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)# Create a list of modelsmodels <-list("Model 1: OLS (Net Income)"= model1,"Model 2: Log-Transformed OLS"= model2,"Model 3: Robust Regression"= robust_model)# Create the model comparison tablemodelsummary( models,stars =TRUE,statistic ="std.error",gof_omit ="IC|Log.Lik.",title ="Comparison of Regression Models for Net Income")
Comparison of Regression Models for Net Income
Model 1: OLS (Net Income)
Model 2: Log-Transformed OLS
Model 3: Robust Regression
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept)
1507267.971*
233.116**
227.491**
(633361.468)
(83.327)
(80.257)
Year
-743.999*
-0.112**
-0.109**
(313.399)
(0.041)
(0.040)
Class
-2669.616***
0.064
0.033
(464.350)
(0.061)
(0.059)
Payroll.Expenses
0.944***
0.000***
0.000***
(0.028)
(0.000)
(0.000)
Paper.Goods
0.695***
0.000***
0.000***
(0.152)
(0.000)
(0.000)
Cleaning.Supplies
0.798***
0.000*
0.000*
(0.156)
(0.000)
(0.000)
Num.Obs.
451
451
451
R2
0.825
0.436
R2 Adj.
0.823
0.429
F
419.545
68.676
71.729
RMSE
9249.44
1.22
1.22
Three regression models were analyzed to predict Net Income. The initial model using untransformed Net Income showed significant effects for Year, Class, and Payroll Expenses, although the coefficients were large and difficult to interpret directly due to the scale of the outcome. The second model, with log-transformed Net Income, improved interpretability and revealed that Year and Payroll Expenses remained statistically significant predictors, while the effect of Class was no longer significant. Finally, the robust regression model confirmed the findings from the log-transformed model, with Year and Payroll Expenses showing consistent significance and effect sizes, while controlling for potential outliers. Overall, the log-transformed models provided more stable and interpretable results, suggesting that transformations and robust methods improved the reliability of the model estimates.
While model performance improved with log transformation and robust techniques, residual diagnostics indicated areas needing further refinement. Future work may incorporate non-linear methods or variable interaction terms.
9 Conclusion
Multiple modeling strategies were explored to predict Net Income, including the application of variable transformations and robust regression techniques. Transforming Net Income using a logarithmic scale helped address issues of skewness and heteroskedasticity, leading to improved linearity and model fit. In addition, robust regression was employed to minimize the influence of outliers, further strengthening the stability and interpretability of the model. Although these adjustments improved performance, residual analysis revealed that important patterns of variability and potential nonlinearity remained unaccounted for. These persistent issues suggest that additional factors beyond those currently included may play a significant role in influencing Net Income. Future research should focus on incorporating additional predictors, exploring potential interaction effects, and considering more flexible modeling approaches, such as generalized additive models or mixed-effects models, to better capture the complexity of the data.
10 References
Fox, J. (2016). Applied Regression Analysis and Generalized Linear Models. Sage Publications.
Huber, P. J. (1981). Robust Statistics. Wiley.
10.1 Quarto
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
10.2 Running Code
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
Code
1+1
[1] 2
You can add options to executable code like this
[1] 4
The echo: false option disables the printing of code (only output is displayed).
Source Code
------title: "Analyzing Financial Data to Predict Net Income of a Building Services Company (2019–2023)"author: "Shelley Wright"date: "2025-04-26"format: html: self-contained: true toc: true warning: false number-sections: true code-fold: true code-tools: true execute: echo: true ---``` ```# IntroductionPredicting net income accurately is vital for operational planning and strategic decision-making in business industries. Reliable forecasts enable organizations to allocate resources effectively, plan for growth, and mitigate potential financial risks. This study analyzes detailed financial data collected from a building services company spanning the years 2019 to 2023, with the primary goal of modeling and predicting net income based on a variety of operational factors. Key predictor variables examined include service class, fiscal year, expenditures on cleaning supplies, purchases of paper goods, and payroll expenses.To identify the most suitable predictive framework, multiple linear regression models were constructed and compared, assessing model performance based on fit statistics and diagnostic measures. During the course of the analysis, log transformations were applied to correct for skewed distributions and improve linearity, while robust regression techniques were employed to better handle the presence of outliers and heteroscedasticity that could otherwise distort model estimates. These methodological adjustments led to improved model performance; however, residual diagnostics suggested that additional refinements could further enhance predictive accuracy. Overall, this study highlights both the challenges and opportunities involved in financial modeling for service-based industries and lays the groundwork for future enhancements using more complex modeling approaches or expanded datasets.# Setup```{r setup, include=FALSE}# Load librarieslibrary(plotrix)library(ggplot2)library(dplyr)library(gt)library(tidyr)library(scales)library(plotly)# Now load the real dataset (not faking anymore)PandL <-read.csv("PandL.csv")#| label: tbl-expense-long#| echo: true#| code-fold: true# Create summarized expense_long datasetexpense_long <- PandL %>%group_by(Year) %>%summarise(Cleaning =sum(Cleaning.Supplies, na.rm =TRUE),Paper =sum(Paper.Goods, na.rm =TRUE),Payroll =sum(Payroll.Expenses, na.rm =TRUE),.groups ="drop" ) %>%pivot_longer(cols =c(Cleaning, Paper, Payroll),names_to ="Expense_Type",values_to ="Amount" )```# Data Cleaning and PreparationThe dataset underwent extensive cleaning procedures to ensure accuracy, consistency, and readiness for statistical analysis. Initial steps involved correcting data inconsistencies, standardizing formats, and addressing missing values to maintain the integrity of the analysis. Financial data was systematically exported into Excel files, organized separately by year to facilitate easier handling and year-over-year comparisons. Each exported file contained comprehensive financial records for all accounts managed by the company during that period.To streamline the analysis and maintain confidentiality, account names were removed and each account was categorized into one of four broad classes: commercial, industrial, medical, or university-affiliated. This classification allowed for clearer segmentation of financial patterns across different types of service clients. A categorical variable representing the year was then added to each dataset to preserve the time dimension essential for longitudinal analysis. Given the original format of the exported files, each dataset required transposition so that financial measures were properly aligned with their corresponding accounts and years.Following these transformations, the individual yearly datasets were carefully merged into a single, unified dataset to serve as the foundation for the predictive modeling process. This merging process was performed with close attention to maintaining consistency across variables and ensuring that no financial records were inadvertently lost or duplicated. The resulting dataset provided a clean, structured, and comprehensive resource suitable for in-depth regression modeling and further statistical exploration.# DemographicsTo gain a deeper understanding of the underlying characteristics within the dataset, demographic summaries were created based on key financial and operational variables. In this context, "demographics" refers to the classification of accounts according to factors such as service class (commercial, industrial, medical, or university-affiliated) and fiscal year of service. These groupings allowed for the identification of patterns and differences across various types of accounts, providing essential context for interpreting financial trends over time. By summarizing the distribution of accounts across classes and years, it became possible to observe which sectors contributed most significantly to net income and how these contributions shifted annually. These demographic breakdowns were critical in informing the selection and construction of the regression models, ensuring that the models captured not just overall financial performance, but also the unique contributions of different account types and operational years to the company's financial outcomes.The pie chart titled "Number of Different Accounts from 2019--2023" provides a visual summary of the distribution of account types within the company's financial data. It reveals that commercial accounts make up the majority of the company's business, representing 279 accounts, followed by industrial (81 accounts), medical (62 accounts), and university-affiliated accounts (44 accounts). This breakdown highlights the company's heavy reliance on the commercial sector for its financial stability. Understanding the relative proportions of account types is important because it informs the regression modeling process --- specifically, it ensures that variations in net income are interpreted within the context of the underlying client demographics. Differences in service class may drive variability in financial performance, making this demographic breakdown a critical component of the overall analysis.```{r}# Load the necessary librarylibrary(plotrix)# Set the background color to light bluepar(bg ="lightblue")# Create the class counts tableclass_counts <-table(PandL$Class)# Define custom labelslabels <-c("Commercial", "Industrial", "Medical", "University")# Create labels with counts includedlabels_with_counts <-paste0(labels, "\n(", class_counts, ")")# Create the exploding 3D pie chartpie3D( class_counts,labels = labels_with_counts,explode =0.3,height =0.2,main ="Number of Different Accounts from 2019–2023",labelcex =1.1,col =rainbow(length(class_counts)))```The bar graph titled "Number of Accounts by Year" displays the distribution of accounts from 2019 to 2023. Each bar represents the total number of accounts recorded in a given year, with the exact count labeled above each bar for clarity. The number of accounts remained relatively consistent over the five-year period, ranging from 90 accounts in 2019 to 98 accounts in 2020, which marked the peak. Subsequent years showed minimal variation, with 91 accounts in 2021, 94 in 2022, and 93 in 2023.This stability in account numbers across years is important for the modeling process, as it suggests that there were no major fluctuations in client volume that could introduce bias or disproportionately influence the regression analysis. Consistent sample sizes across years support more reliable year-to-year comparisons and help ensure that any trends detected in net income are more likely attributable to operational or financial variables rather than shifts in the size of the client base.```{r}#| label: fig-accounts-per-year#| fig-cap: "Number of Accounts by Year (2019–2023)"#| echo: true#| code-fold: true # <-- collapsible# Summarize to get countsaccounts_per_year <- PandL %>%group_by(Year) %>%summarise(account_count =n()) %>%arrange(Year)# Create the bar plotggplot(accounts_per_year, aes(x =factor(Year), y = account_count)) +geom_col(fill ="steelblue", color ="black", width =0.7) +geom_text(aes(label = account_count), vjust =-0.5, size =5, fontface ="bold") +labs(title ="Number of Accounts by Year",x ="Year",y ="Number of Accounts" ) +ylim(0, 110) +theme_minimal(base_size =14) +theme(plot.title =element_text(hjust =0.5, face ="bold"),axis.text.x =element_text(angle =0, vjust =0.5),panel.grid.major.y =element_line(color ="gray80"),panel.grid.minor =element_blank(),plot.background =element_rect(fill ="lightyellow", color =NA),panel.background =element_rect(fill ="lightyellow", color =NA) )```The line graph titled "Total Net Income by Year" illustrates the overall net income generated by the company from 2019 to 2023. Net income values are plotted for each year, with specific dollar amounts labeled at each point and a dollar sign symbol marking each year's income. The y-axis represents Net Income (\$), while the x-axis displays the years in chronological order.The graph shows a clear upward trend from 2019 to 2020, where net income increased significantly from \$921,215 to a peak of \$1,326,715. After reaching this peak in 2020, net income steadily declined over the following years, dropping to \$1,169,509 in 2021, \$1,052,897 in 2022, and \$1,052,813 in 2023. While the rate of decline slowed between 2022 and 2023, the overall trend after 2020 indicates a decrease in profitability.In the context of this project, this trend is crucial for understanding financial dynamics over time and emphasizes the need for predictive modeling. The sharp increase in 2020 followed by a gradual decline suggests that specific operational factors---such as changes in payroll expenses, supply costs, or shifts in account types---may have impacted net income. By incorporating variables like year, class, payroll expenses, cleaning supplies, and paper goods into regression models, the analysis aims to identify which factors most strongly influenced these income patterns. This understanding will not only enhance prediction accuracy but also provide insights for future financial planning and decision-making within the company.```{r}#| label: fig-net-income-by-year#| fig-cap: "Total Net Income by Year (2019–2023)"#| echo: true#| code-fold: true#| warning: false ####Line graph for net income by year ###### Summarize Net Income by Yearnet_income_by_year <- PandL %>%group_by(Year) %>%summarise(Net_Income =sum(Net.Income, na.rm =TRUE)) %>%arrange(Year)# Calculate max y for paddingmax_income <-max(net_income_by_year$Net_Income, na.rm =TRUE)y_limit <- max_income *1.25# Add 25% padding# Fancy Line Chart with Dollar Signs as Pointsggplot(net_income_by_year, aes(x = Year, y = Net_Income)) +geom_line(color ="steelblue", linewidth =1.5) +# Dollar sign symbols as data pointsgeom_text(aes(label ="$"),size =8, # Adjust size as neededvjust =0.5,color ="darkgreen",fontface ="bold" ) +# Actual Net Income values above each pointgeom_text(aes(label =comma(round(Net_Income))),vjust =-2.5,size =3.5,fontface ="bold" ) +labs(title ="Total Net Income by Year",x ="Year",y ="Net Income ($)" ) +scale_y_continuous(labels =dollar_format(),limits =c(500000, y_limit) ) +theme_minimal(base_size =14) +theme(plot.title =element_text(hjust =0.5, face ="bold"),panel.background =element_rect(fill ="lightblue", color =NA),plot.background =element_rect(fill ="lightgrey", color =NA) )```The figure below shows the trend in Cleaning, Paper, and Payroll expenses from 2019 to 2023. Payroll expenses were consistently the highest, increasing steadily over the period. Cleaning expenses peaked in 2020 before declining and stabilizing, while Paper expenses remained relatively low and stable throughout the study period. Payroll expenses were consistently the highest, increasing steadily over the period. Cleaning expenses peaked in 2020 before declining and stabilizing, while Paper expenses remained relatively low and stable throughout the analysis.```{r}#| label: fig-expenses-over-time#| fig-cap: "Expenses by Category Over Time (2019–2023)"#| echo: true#| code-fold: trueggplot(expense_long, aes(x = Year, y = Amount, color = Expense_Type)) +geom_line(size =1.5) +geom_point(size =3) +geom_text(aes(label =comma(round(Amount))),vjust =-0.7,size =2.5,fontface ="bold",show.legend =FALSE ) +scale_y_continuous(labels =dollar_format(),breaks =seq(0, 1600000, by =250000) ) +labs(title ="Expenses by Category Over Time",x ="Year",y ="Total Expense ($)",color ="Expense Type" ) +theme_minimal(base_size =14) +theme(plot.title =element_text(hjust =0.5, face ="bold"),legend.position ="top",panel.background =element_rect(fill ="lightyellow", color =NA),plot.background =element_rect(fill ="lightyellow", color =NA),plot.margin =margin(5,5,5,5) # more compact )```The final demographic illustration of demographics is a 3D scatter plot titled "Predicted Net Income by Payroll and Year" illustrates the relationship between payroll expenses, year, and net income for the building services company from 2019 to 2023. Payroll expenses are plotted on the x-axis, years are ordered along the y-axis, and net income is represented on the z-axis. Each point reflects an account in a given year, with colors distinguishing between years and lines connecting observations within the same year. The plot shows a generally positive relationship between payroll expenses and net income, although the trend is not perfectly consistent across years. Net income peaked in 2020, followed by a gradual flattening in subsequent years despite continued increases in payroll expenses. This visualization highlights the importance of including both payroll and year in the predictive models and suggests the need for a flexible modeling approach to account for variability over time.```{r}#| label: Predicted Net Income by Payroll and Year#| fig-cap: "Predicted Net Income by Payroll and Year"#| echo: true#| code-fold: true##### 3D Plot of Net Income by Year by Payroll ####library(plotly)PandL$Year <-as.factor(PandL$Year)PandL$YearNum <-as.numeric(as.character(PandL$Year)) # Needed for axis orderingplot_ly( PandL, x =~Payroll.Expenses, y =~YearNum, z =~Net.Income, color =~Year,colors =c("cyan", "orange", "pink", "yellow", "purple"),type ="scatter3d", mode ="lines+markers",marker =list(size =8,line =list(color ='black', width =2) ),line =list(width =5),width =900,height =600) %>%layout(margin =list(l =60, r =60, b =50, t =50), # Tighter top marginscene =list(camera =list(eye =list(x =1.6, y =-1.6, z =1.2)), # Nicely centered viewxaxis =list(title =list(text ="Payroll Expenses", font =list(size =14)),tickfont =list(size =11),gridcolor ='grey50', # Darker grid lineszerolinecolor ='grey50'# Darker zero line ),yaxis =list(title =list(text ="Year", font =list(size =14)),tickvals =c(2019, 2020, 2021, 2022, 2023),ticktext =c("2019", "2020", "2021", "2022", "2023"),tickfont =list(size =11),gridcolor ='grey50', # Darker grid lineszerolinecolor ='grey50' ),zaxis =list(title =list(text ="Net Income", font =list(size =14)),tickfont =list(size =11),gridcolor ='grey50', # Darker grid lineszerolinecolor ='grey50' ) ),title =list(text ="Predicted Net Income by Payroll and Year",font =list(size =22),x =0.5,xanchor ="center",yanchor ="top" ),plot_bgcolor ='#e6f7ff', # Light blue inside plotpaper_bgcolor ='#e6f7ff'# Light blue outside plot )```# Methods## Initial Regression ModelA multiple linear regression was performed to predict net income based on year, payroll expenses, paper goods expenses, and cleaning supplies expenses. The model was statistically significant overall (F(7, 458) = 273.7, p \< 2.2e-16), indicating that, together, the predictors explain a substantial portion of the variance in net income. The Multiple R-squared value was 0.8071, and the Adjusted R-squared was 0.8041, suggesting that approximately 80% of the variability in net income is explained by the included predictors, a very strong model fit for financial data.Examining the individual predictors, payroll expenses had a highly significant positive association with net income (β = 0.945, p \< 0.001), meaning that as payroll expenses increase, net income tends to increase proportionally. Similarly, paper goods expenses (β = 0.735, p \< 0.001) and cleaning supplies expenses (β = 0.530, p \< 0.001) were also significantly positively associated with net income, though with slightly smaller effects compared to payroll.The year variable showed mixed results. Compared to the baseline year (2019), only 2020 showed a statistically significant increase in net income (β = 3,154, p = 0.028), while differences for 2021, 2022, and 2023 were not statistically significant, suggesting that after the sharp rise in 2020, year-to-year changes were less influential once operational costs were accounted for.The residuals of the model had a standard error of 9,747, with residuals distributed fairly symmetrically around zero, though a few larger deviations indicate the presence of some variation not fully explained by the model.Overall, the model demonstrates that operational expenses --- particularly payroll --- are strong predictors of net income in this dataset, with year effects being most notable in 2020.```{r}#| label: fig-residual-diagnostics#| fig-cap: "Residual Diagnostics for Regression Model"#| echo: true#| code-fold: true#| warning: false model <-lm(Net.Income ~ Year + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)summary(model)# Set up 2x2 plot gridpar(mfrow =c(2, 2))# Plot standard diagnostic plotsplot(model)```# Model DiagnosticsResidual analysis revealed several key diagnostic issues with the initial linear regression model. The Residuals vs Fitted plot showed that most points were centered around zero, which is desirable; however, other diagnostic plots indicated violations of core regression assumptions. The Q-Q plot exhibited heavy deviations in the tails, suggesting that the residuals were not normally distributed. The Scale-Location plot displayed an upward trend, indicating the presence of heteroscedasticity, where the variance of residuals increases with fitted values. Additionally, the Residuals vs Leverage plot identified a few high-leverage points that could disproportionately influence the model estimates. Collectively, these diagnostic findings highlighted instability in the residuals and significant departures from ideal regression conditions. As a result, further modeling adjustments were pursued to address these issues and improve the overall fit and robustness of the predictive model.### TransformationA log transformation was applied to net income to address skewness, reduce the influence of extreme values, and improve adherence to linear regression assumptions of normality and homoscedasticity.```{r}### Regression Model 2: Log Net Income Predictors#| label: tbl-model2-summary#| tbl-cap: "Summary of Linear Model 2: Predictors of Log Net Income"#| echo: true#| code-fold: true#| warning: false PandL <- PandL %>%filter(Net.Income >0)# Fit the modelmodel2 <-lm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)# Display model summarysummary(model2)par(mfrow =c(2, 2))plot(model2)```A multiple linear regression was performed to predict log-transformed net income using year, service class, payroll expenses, paper goods, and cleaning supplies. The model was statistically significant (F(8, 442) = 42.84, p \< 2.2e-16) and explained approximately 43% of the variance in log net income (Adjusted R² = 0.4265). Payroll expenses remained a strong positive predictor (p \< 0.001), while paper goods and cleaning supplies also showed significant positive effects. Net income was significantly lower in 2022 and 2023 compared to 2019 after adjusting for expenses. Service class was not a significant predictor. Although fewer observations were used due to missingness after log transformation, the model improved in meeting key regression assumptions.The diagnostic plots for the log-transformed regression model reveal both improvements and ongoing limitations when compared to the original OLS model without transformation. In the Residuals vs Fitted plot, the residuals show less dramatic fanning and fewer extreme deviations compared to the original model, suggesting that the log transformation helped reduce heteroscedasticity. However, there is still some curvature and spread in the residuals, indicating that non-linearity and non-constant variance have not been entirely eliminated. In the Q-Q plot, the residuals now align much more closely with the theoretical normal line than in the original OLS model, particularly in the center of the distribution, although some deviation persists in the tails. This suggests improved but not perfect normality of residuals after transformation.The Scale-Location plot shows that variability of the residuals across fitted values is more stabilized than before, but some increasing spread remains, especially at higher fitted values. Finally, the Residuals vs Leverage plot indicates that most points have low leverage, but a few observations still exert some influence on the model, which could continue to affect model stability.Overall, compared to the original OLS model without transformation, the log-transformed model exhibits better adherence to linear regression assumptions, particularly in terms of normality and variance homogeneity. However, some minor issues with non-linearity, residual spread, and influential points remain, suggesting that further refinement---such as moving to a robust regression model---may provide additional improvements by reducing sensitivity to these outliers.# Robust Regression AnalysisGiven persistent issues, a Robust Regression was employed to reduce the influence of outliers. Robust regression was chosen because the outliers are distorting the ordinary least square estimates, the residuals are non-normally distributed and the residual plots still indicate heteroscedasticity. The robust model used M-estimation to downweight extreme residuals.```{r}### Robust Regression Model: Log Net Income Predictors#| label: tbl-robust-model-summary#| tbl-cap: "Summary of Robust Regression Model for Log Net Income"#| echo: true#| code-fold: true#| warning: false #| message: falselibrary(MASS)PandL <- PandL %>%filter(Net.Income >0)# Fit the robust regression modelrobust_model <-rlm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)# Display summary of the robust modelsummary(robust_model)#| label: fig-robust-residual-diagnostics#| fig-cap: "Residual Diagnostics for Robust Regression Model"#| echo: true#| code-fold: true#| warning: false # Set 2x2 plotting areapar(mfrow =c(2, 2))# Plot residuals vs fitted valuesplot(fitted(robust_model), residuals(robust_model),main ="Residuals vs Fitted",xlab ="Fitted Values",ylab ="Residuals",pch =20, col ="blue")abline(h =0, col ="red", lty =2)# Normal Q-Q plotqqnorm(residuals(robust_model),main ="Normal Q-Q Plot",pch =20, col ="blue")qqline(residuals(robust_model), col ="red", lty =2)# Scale-Location plotplot(fitted(robust_model), sqrt(abs(residuals(robust_model))),main ="Scale-Location",xlab ="Fitted Values",ylab ="Sqrt(|Residuals|)",pch =20, col ="blue")# Residuals vs Leverage plot (approximate leverage from lm model)plot(hatvalues(model2), residuals(robust_model),main ="Residuals vs Leverage (Approximate)",xlab ="Leverage (from lm model)",ylab ="Residuals",pch =20, col ="blue")```The model estimates indicate that payroll expenses remained a highly significant and strong positive predictor of net income (t = 12.13), similar to the findings from previous OLS models. Paper goods expenses (t = 4.47) and cleaning supplies expenses (t = 2.21) also maintained significant positive relationships. Regarding time effects, 2022 showed a statistically significant decline in net income compared to 2019 (t = -1.99), and 2023 showed a similar negative trend (t = -1.88), although slightly less significant. The year 2020 and 2021 did not show meaningful differences from the baseline year. Service class again was not statistically significant, reinforcing earlier findings that type of account was not a major driver of net income variation once operational expenses were considered.The residual standard error for the robust model was 1.222, very close to the residual standard error observed in the log-transformed OLS model (1.228), suggesting that the robust model achieved similar fit but with the advantage of being less sensitive to influential outliers. Fifteen observations were still excluded due to missingness (likely due to log(0) issues), consistent with previous analyses.Overall, the robust regression confirms the key conclusions of the log-transformed OLS model --- payroll expenses, paper goods, and cleaning supplies are strong predictors of net income, and net income showed a modest decline in the later years of the study. However, by reducing the influence of outliers, the robust model provides greater confidence in these estimates, especially in a financial dataset where a few extreme accounts could otherwise distort results.The diagnostic plots from the robust regression model show further improvements compared to the original OLS and log-transformed OLS models. In the Residuals vs Fitted plot, the residuals are more symmetrically distributed around zero, although some slight curvature and variability remain at lower fitted values. Compared to the earlier OLS plots, the robust regression has reduced the influence of extreme residuals, making the pattern less pronounced and more consistent.The Normal Q-Q plot shows some improvement in normality relative to previous models. While minor deviations from the theoretical line persist, particularly in the tails, the central portion of the distribution aligns more closely with normal expectations. This suggests that while the residuals are not perfectly normal, they are much closer to normality under the robust modeling approach.In the Scale-Location plot, the spread of residuals across fitted values appears more stabilized than in the earlier models. Most points cluster at lower fitted values, and while some heteroscedasticity remains, particularly at the lower end, it is less extreme than previously observed. The overall pattern suggests an improvement in variance homogeneity.The Residuals vs Leverage plot indicates that, similar to previous models, most points have low leverage, but a small number of observations still show higher leverage. However, the influence of these points appears less severe, reflecting the robust regression's ability to minimize the impact of high-leverage outliers on the model's coefficients.Overall, compared to both the initial OLS model and the log-transformed OLS model, the robust regression diagnostics demonstrate better performance in terms of residual symmetry, variance stabilization, and reduced sensitivity to influential points. Although minor issues still remain, the robust model represents a meaningful refinement, offering more reliable and stable coefficient estimates for interpreting the financial relationships within the dataset.``` ``````{r}#| warning: false # Load librarieslibrary(dplyr)library(ggplot2)library(MASS)PandL <- PandL %>%filter(Net.Income >0)# Clean the dataset: drop NAs and filter positive Net IncomePandL_clean <- PandL %>%drop_na(Net.Income, Payroll.Expenses, Paper.Goods, Cleaning.Supplies, Year, Class) %>%filter(Net.Income >0) # Only keep rows where Net Income is positive# Fit the robust modelrobust_model <-rlm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies,data = PandL_clean)# Create a dataset with actual and fitted valuesPandL_filtered <- PandL_clean %>%mutate(actual_log_income =log(Net.Income),fitted_robust =predict(robust_model, newdata = PandL_clean) )# Plotggplot(PandL_filtered, aes(x = Payroll.Expenses, y = actual_log_income)) +geom_point(alpha =0.7, color ="steelblue", size =2) +geom_line(aes(y = fitted_robust), color ="firebrick", linewidth =2) +labs(title ="Predicted vs Actual (Robust Regression Model)",x ="Payroll Expenses",y ="Log of Net Income" ) +theme_minimal(base_size =14) +theme(panel.background =element_rect(fill ="grey90", color =NA),plot.background =element_rect(fill ="grey95", color =NA),panel.grid.major =element_line(color ="grey70"),panel.grid.minor =element_line(color ="grey85"),axis.line =element_line(color ="black"),plot.title =element_text(face ="bold", hjust =0.5) )```The plot demonstrates a clear positive association between Payroll Expenses and Net Income on a logarithmic scale. As payroll spending increases, companies generally achieve higher levels of profitability, suggesting that investment in human resources and labor costs may be strongly tied to financial performance. The relationship is more variable at lower payroll levels, indicating that smaller companies or organizations with modest payrolls experience a wider range of financial outcomes. In contrast, among organizations with higher payroll expenses, the relationship between payroll and profitability becomes more stable and predictable. The use of a robust regression model strengthens this conclusion by reducing the influence of extreme outliers, allowing the central trend to emerge more clearly.``` ```# Results```{r}### Comparison of Regression Models#| label: tbl-model-LS, Log-Transformed OLS, and Robust Regression Models Predicting Net Income"#| echo: true#| code-fold: true#| warning: false #| message: false# Load necessary librarieslibrary(MASS)library(modelsummary)# Load your datasetPandL <-read.csv("PandL.csv") # <-- adjust path if neededPandL <- PandL %>%filter(Net.Income >0)# Fit the modelsmodel1 <-lm(Net.Income ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)model2 <-lm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)robust_model <-rlm(log(Net.Income) ~ Year + Class + Payroll.Expenses + Paper.Goods + Cleaning.Supplies, data = PandL)# Create a list of modelsmodels <-list("Model 1: OLS (Net Income)"= model1,"Model 2: Log-Transformed OLS"= model2,"Model 3: Robust Regression"= robust_model)# Create the model comparison tablemodelsummary( models,stars =TRUE,statistic ="std.error",gof_omit ="IC|Log.Lik.",title ="Comparison of Regression Models for Net Income")```Three regression models were analyzed to predict Net Income. The initial model using untransformed Net Income showed significant effects for Year, Class, and Payroll Expenses, although the coefficients were large and difficult to interpret directly due to the scale of the outcome. The second model, with log-transformed Net Income, improved interpretability and revealed that Year and Payroll Expenses remained statistically significant predictors, while the effect of Class was no longer significant. Finally, the robust regression model confirmed the findings from the log-transformed model, with Year and Payroll Expenses showing consistent significance and effect sizes, while controlling for potential outliers. Overall, the log-transformed models provided more stable and interpretable results, suggesting that transformations and robust methods improved the reliability of the model estimates.While model performance improved with log transformation and robust techniques, residual diagnostics indicated areas needing further refinement. Future work may incorporate non-linear methods or variable interaction terms.# ConclusionMultiple modeling strategies were explored to predict Net Income, including the application of variable transformations and robust regression techniques. Transforming Net Income using a logarithmic scale helped address issues of skewness and heteroskedasticity, leading to improved linearity and model fit. In addition, robust regression was employed to minimize the influence of outliers, further strengthening the stability and interpretability of the model. Although these adjustments improved performance, residual analysis revealed that important patterns of variability and potential nonlinearity remained unaccounted for. These persistent issues suggest that additional factors beyond those currently included may play a significant role in influencing Net Income. Future research should focus on incorporating additional predictors, exploring potential interaction effects, and considering more flexible modeling approaches, such as generalized additive models or mixed-effects models, to better capture the complexity of the data.# References- Fox, J. (2016). *Applied Regression Analysis and Generalized Linear Models*. Sage Publications.- Huber, P. J. (1981). *Robust Statistics*. Wiley.------------------------------------------------------------------------## QuartoQuarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.## Running CodeWhen you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:```{r}1+1```You can add options to executable code like this```{r}#| echo: false2*2```The `echo: false` option disables the printing of code (only output is displayed).